Every time a dataset is created, either for data management proposes or for statistical analyses, it is imperative that the each variable be reviewed. Not only should the evaluation provide summary statistics and graphical displays to detect data errors, it should also present the results in a thorough, but succinct manner. To accomplish this goal, descriptive summaries for each variable should be created according to their characteristics.
The best available option for generating descriptive data set summaries is found in the Hmisc: Harrell Miscellaneous package for the R statistical programming environment. The function describe function determines whether the variable is character, factor, category, binary, discrete numeric, and continuous numeric, and prints a concise statistical summary according to each.
Of note:
For a binary variable, the sum (number of 1’s) and mean (proportion of 1’s) are printed
For any variable with at least 20 unique values, the 5 lowest and highest values are printed
A numeric variable is deemed discrete if it has <= 10 unique values. In this case, quantiles are not printed.
A frequency table is printed for any non-binary variable if it has no more than 20 unique values
“If I take it up I must understand every detail,” said he. “Take time to consider. The smallest point may be the most essential.” — Sherlock Holmes The Adventure of the Red Circle
For a couple of decades we have been loyal users of the Hmisc package in general and the describe function in particular as a way to explore data before any analyses. As it is often the case in the R ecosystem, there are numerous ways to accomplish this task (see summarizing data blogs here and here for a dated but yet extensive review). Our love affair with Hmisc::describe was it concise look pre rmarkdown days (Sweave/Latex/PDF) and it’s linking with SAS formatted data sets (for example labels, formats, special missing). Indeed, in the clinical research industry, SAS, especially SAS formatted data sets (SAS transport .xpt or native .sas7bdat files) remain widely used albeit the programming language has somehow lost its monopoly with the ever more presence of another language, R, in particular, R/Pharma. Dr. Frank Harrell which developed the Hmisc package has been from our perspective a luminary as he layout the possibilities embedded in the R language especially in the clinical research environment.
For some time now we wanted to modernize the aforementioned describe function to provide a modern and interactive interface providing the user those tools that we often imagine after years of the static (HTML and/or PDF) report use given more interaction for the user. We took the 2021 RStudio Table Contest to accomplish such a goal. In particular by using the power of reactable combined with embedded with plotly interactive figures within a modern flexdashboard to generate concise summaries of every variable in a data set with minimal user configuration. In other for other users to utilize such a powerful summary table we went a step further and wrapped our work into the describer package.
For this challenge we selected a CDISC (Clinical Data Interchange Standards Consortium) ADaM (Analysis Data Model) ADSL (Analysis Data Subject Level) subset dataset as an illustration. The ADSL dataset structure is one record per subject and contains variables such as subject-level population flags, planned and actual treatment variables, demographic information, randomization factors, subgrouping variables, and important dates originated from the PHUSE CDISC Pilot replication study.
The describer package provides an interface for the interactive table
describer consists of two main functions:
- describe_data(): creates a comprehensive tibble of variable metadata using Hmisc::describe as the engine
- describer(): creates an interactive table using Hmisc::describe + reactable.
describer():The input to describer() is a tibble that is produced by describe_data().
The output of describer() is a reactable display with columns for variable number (NO), type of variable (TYPE), variable name and label (VARIABLE), number observed (OBSERVED), number and percent missing (MISSING), number of unique values (DISTINCT), and an interactive display (INTERACTIVE FIGURE).
For each variable, there is additional dropdown details based on variable type (character, numeric, date), which are viewable by selecting .
Built-in Interactivity:
- Search: Search the dataset variables by label
- Figures: Interactive figures are provided for each dataset variable dependent on variable type. Zoom and hover for more details.
Additional Interactivity:
- Filters: Filters can be created for any of the columns of the describe_data by adding crosstalk widgets and specifying a ‘SharedData’ object in the describer() function. Examples shown include subsetting by variable type and filtering based on % missing.